Common formats

Sequence formats

fasta: plain text records that contain header and sequence (nucleic acid or protein)


In [1]:
cat ../data/rose.fa


>ROSE1
GCCGCCGAGAGGUGGCGUCCCCGACGCCUCAUGGGUCGGGAACGACUGAGACGGGCACCG
GUCGUGUCCGGUGCCGCUCGUAUCCAUUUUGCUCCUUGGAGGAUUUGGCUAUGCGCA

In [2]:
head ../data/contigs.fasta


>NODE_1667_length_78_cov_19440_ID_3333
AGTGACGCTAATGATGCCTACAATGCCCCGGAGACTGGGCTGTGTAGGTGCGTTCGCCTC
CAGCTATCATCGTCCGGG
>NODE_1664_length_78_cov_14158.5_ID_3327
GCTGCAGATGGCCGCCGCCTTCGAGCACTTCAACGAGGTGCATCCGCACTCGGCGCTGAA
GATGAAATCACCCGGAAA
>NODE_1663_length_79_cov_10190.5_ID_3325
CGAGGTGCATCCGCACTCGGCGCTGAAGATGAAATCACCCGGAGAGTTAAGGCAGCATCG
TGCCTCCCCACAGGGGAGA
>NODE_1661_length_80_cov_11498_ID_3321

fastq: plain text records that contain HTS sequence reads and associated quality scores


In [3]:
zcat ../data/BJ-HSR1_R1.fastq.gz | head


@NS500159:12:H2FJ5AFXX:1:11101:12552:1058 1:N:0:1
NCACTNCATCGAGTGCAAAAAGATGGCGGATGCACTGCCAACCAGCATGGCAATTGCAATGCAGATAGCACGCCAGCGACCTAGATAGGTCAGCGCAGCAAGCTGGTCATGGAGGTGGAAACGTGGCTTCATCTGTACAACGAGTTAGAT
+
#<<<.#FFFFFAFFFF.AA7FAF7AFFAFFF<FA<<A.F.A<FF.FA<FAF..F<AFF<AAF...<<.<F77A<<FFA<.AF.AFF.<.AF.7<7<<7<F<F7FF<A..F<F.<FA...<<<.7.F<FF7FA<A...<7.<<.7FF77.<
@NS500159:12:H2FJ5AFXX:1:11101:6058:1058 1:N:0:1
NCTAANGAATGATTGTAGTCCTCACACCACTCAAACGTCAGCTTATGGCGGCGGATTCAGCCGGTCGATGCAACACACTATGACACGGCCCAGGCCGAAGGAGTGCTGCAGATGGCCGCCGCCATCGAGAACATCCACGCGGAGCAACCC
+
#..7A#F<F7FFFF.AAF)FFFF.F)FF.7F.)<FFFFF)77F)..FFFF.FFFFFF<..F<F.)A<.FF..FFA.<F.AF7.A.7)<FAA.F7FA.AAF<..F7F<.FF..F7.FAFFAF.A.A7..AA...7<<))<F7F77A.)77)
@NS500159:12:H2FJ5AFXX:1:11101:24977:1060 1:N:0:1
NTGCANCACACTATGACACGGCCCAGGCCGAAGGAGTGCTGCAGATGGCCGCCGCCTTCGAGCACTTCAACGAGGTGCATCCGCACTCGGCGCTGAAGCTGAAATCACCCGGAGAGTTAAGGCAGCATCGTGCCTCCCCACAGGGGCGTTA

gzip: stdout: Broken pipe

Quality scores

$$Q_{phred}=-10log_{10}(e)$$

where:

  • $Q_{phred}$ - quality score
  • e - probability of a base being called wrong

How to encode it to text?

$Q_{phred} + 33$

LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL................................................. 
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
|                         |    |        |                              |                     |
33                        59   64       73                            104                   126
 0........................26...31.......40

@NS500159:12:H2FJ5AFXX:1:11101:12552:1058 1:N:0:1:

  • @NS500159 - machine id
  • 12 - run number
  • H2FJ5AFXX - flowcell id
  • 1 - lane
  • 11101 - tile number
  • 12552:1058 - x and y coordinates
  • 1 - read 1 or 2 (for paire ends)
  • N - filtered (Y) or not (N)
  • 0 - always 0 for HiSeq and NextSeq
  • 1 - sample no from the sample sheet

genebank: similar to fasta but allows annotations (metadata).

Old and busted. No standard, very buggy. Very bad format!

Alignment formats

sam plain text records that contain read and alignment info

Header lines start with @ and contain metadata: reference sequences names, lengths, aligner, etc.

Each alignment record contains 11 mandatory fields:

  • QNAME - query template name (think header from fastq file)
  • FLAG - bitwise flag (more on it in a moment)
  • RNAME - reference sequence name (e.g. chr1)
  • POS - 1-based left-most mapping position
  • MAPQ - mapping quality (think uniqueness of the mapping)
  • CIGAR - details of the mapping (match/mismatch/indel/clipping etc)
  • RNEXT - reference sequence name for the pair (mate)
  • PNEXT - mapping position for the pair (mate)
  • TLEN - template (query) length
  • SEQ - (aligned) segment sequence (not necessarily entire query sequence)
  • QUAL - quality, as in fastq

FLAG field

This is possibly the most important field in practical terms.

  • 1 0x1 template having multiple segments in sequencing
  • 2 0x2 each segment properly aligned according to the aligner
  • 4 0x4 segment unmapped
  • 8 0x8 next segment in the template unmapped
  • 16 0x10 SEQ being reverse complemented
  • 32 0x20 SEQ of the next segment in the template being reverse complemented
  • 64 0x40 the first segment in the template
  • 128 0x80 the last segment in the template
  • 256 0x100 secondary alignment
  • 512 0x200 not passing filters, such as platform/vendor quality controls
  • 1024 0x400 PCR or optical duplicate
  • 2048 0x800 supplementary alignment

bam binary (compressed) form of sam

Same as sam but compresses and therefore is not directly readable. But because of the compression efficiency, it is the preferred way of storing alignment data.

You don't usually work with these directly, rather they are produced as intermediate results that get processed further to yield biologically relevant insights.

These are result of any alignment to reference you perform.

pileup tab delimited; records contain aggregate alignment data per reference position

  • reference name
  • reference position
  • reference base
  • read depth at this position
  • read bases (total=read depth):
    • . match on the forward strand
    • , match on the reverse strand
    • ACTGN mismatch on forward strand
    • actgn mismatch on reverse strand
    • +|-[0-9]ACTGNactgn insertion | deletion
    • ^ start of the read segment
    • $ end of the read segment
  • read base qualities

Position (genomic interval) formats

  • gff (former gtf) genomic feature format; tab-delimited plain text
  • bed generic position format
  • vcf variant call format

Common tools

samtools

bedtools

bowtie/bowtie2

tuxedo suite

NCBI blast

Assemblers: velvet, trinity, SPades

Genome browsers

New hotness: kallisto/sleuth

Examples of pipelines/workflows

RNA-seq experiment

SHAPE-seq

Net-Seq


In [ ]: